Skip to content

Add structured logging and feature output diagnostics to FeatureBuilder#370

Open
sundy1994 wants to merge 11 commits into
devfrom
log_file
Open

Add structured logging and feature output diagnostics to FeatureBuilder#370
sundy1994 wants to merge 11 commits into
devfrom
log_file

Conversation

@sundy1994
Copy link
Copy Markdown
Collaborator

Summary

  • Introduces a reusable setup_logger utility in src/team_comm_tools/utils/preprocess.py that writes timestamped logs to ./<output_file_base>/logs/, auto-creates the directory, and guards against duplicate handlers / propagation to root.
  • Wires two loggers into FeatureBuilderfeature_builder.log for top-level run info and summary_details.log for verbose per-column output — and threads the logger through ChatLevelFeaturesCalculator, UserLevelFeaturesCalculator, ConversationLevelFeaturesCalculator, and check_embeddings. print and warnings.warn calls for errors / invalid configs are now mirrored to the log.
  • Adds perf_counter-based timings around each feature method (chat / user / conversation level) and around sentence-vector and BERT generation in check_embeddings.py, so the log captures per-step durations.
  • Adds a post-featurization diagnostics step (generate_summary_stats) that, for each output level, reports columns with high NA ratios, high zero ratios, and groups of highly correlated columns (Spearman, configurable threshold). New constructor params: corr_thresh, min_na_ratio, min_zero_ratio, min_group_size, treat_zero_as_na, drop_redundant_columns. With drop_redundant_columns=True, columns with NAs/zeros that exceeding the thresholds are dropped. Moreover, only one representative per correlated group is kept (chosen by valid-data count and variance) and others in the group are dropped.
  • Logs the run header (timestamp, dataset shape — lines / unique speakers / unique conversations) at the start of featurize.
  • Cleans up imports in check_embeddings.py (drops top-level torch / unused util in favor of narrower from torch import cuda, no_grad).
  • Updates and reorganizes docstrings in feature_builder.py.
  • Adds *.csv and *.log to .gitignore.

Behavior change to flag for review

As discussed, the first x percent feature is deprecated. analyze_first_pct and get_first_pct_of_chat are commented out in feature_builder.py, removing the multi-truncation loop in featurize.

@sundy1994 sundy1994 requested a review from xehu May 7, 2026 18:33
xehu added 2 commits May 31, 2026 21:46
…ver the FB-generated cols (not all cols/metadata); (2) edit the log headers to specify what file the log is for
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds structured logging and performance/diagnostic output to the FeatureBuilder pipeline, threads a logger through the feature calculators and embedding generation, and introduces an opt-in post-processing step to identify and optionally drop sparse/redundant feature columns. It also updates documentation to reflect the new diagnostics and the deprecation/removal of analyze_first_pct.

Changes:

  • Add setup_logger utility and wire structured loggers through FeatureBuilder, calculators, and embedding generation (plus perf timings).
  • Add post-featurization redundancy diagnostics (generate_summary_stats) with opt-in column dropping (drop_redundant_columns and related thresholds).
  • Update docs/examples and rebuild HTML docs artifacts; extend .gitignore for logs/CSVs.

Reviewed changes

Copilot reviewed 66 out of 69 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/team_comm_tools/utils/preprocess.py Adds reusable setup_logger helper to create log dirs and avoid duplicate handlers.
src/team_comm_tools/utils/check_embeddings.py Adds perf timing + logger usage; narrows torch imports; updates function signature.
src/team_comm_tools/utils/calculate_user_level_features.py Adds perf timings for major user-level feature steps and logs durations.
src/team_comm_tools/utils/calculate_conversation_level_features.py Adds per-feature-method perf timings and logs durations.
src/team_comm_tools/utils/calculate_chat_level_features.py Adds per-method perf timings and logs durations; extends constructor to accept a logger.
src/team_comm_tools/feature_builder.py Sets up loggers, adds run header + timings, removes first-% loop, adds redundancy diagnostics and optional column dropping.
docs/source/examples.rst Documents drop_redundant_columns and marks first-% analysis as deprecated/removed.
docs/source/basics.rst Updates customizable-parameters list and adds redundancy-reduction section.
docs/build/html/utils/zscore_chats_and_conversation.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/utils/summarize_features.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/utils/preprocess.html Rebuilt HTML docs artifact; includes setup_logger in docs nav.
docs/build/html/utils/preload_word_lists.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/utils/index.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/utils/gini_coefficient.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/utils/check_embeddings.html Rebuilt HTML docs artifact reflecting signature/logger updates.
docs/build/html/utils/calculate_user_level_features.html Rebuilt HTML docs artifact reflecting logger parameter.
docs/build/html/utils/calculate_conversation_level_features.html Rebuilt HTML docs artifact reflecting logger parameter.
docs/build/html/utils/calculate_chat_level_features.html Rebuilt HTML docs artifact reflecting logger parameter.
docs/build/html/utils/assign_chunk_nums.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/search.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/py-modindex.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/intro.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/index.html Rebuilt HTML docs artifact (formatting/pygments whitespace changes).
docs/build/html/genindex.html Rebuilt HTML docs artifact; index now includes new FeatureBuilder methods.
docs/build/html/features/word_mimicry.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/within_person_discursive_range.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/variance_in_DD.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/turn_taking_features.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/textblob_sentiment_analysis.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/temporal_features.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/reddit_tags.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/readability.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/question_num.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/politeness_v2.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/politeness_v2_helper.html Rebuilt HTML docs artifact; docstring formatting/structure updates.
docs/build/html/features/politeness_features.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/other_lexical_features.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/named_entity_recognition_features.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/lexical_features_v2.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/information_diversity.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/info_exchange_zscore.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/index.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/hedge.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/get_user_network.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/get_all_DD_features.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/fflow.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/discursive_diversity.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/certainty.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/burstiness.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features/basic_features.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features_conceptual/word_ttr.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features_conceptual/turn_taking_index.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features_conceptual/TEMPLATE.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features_conceptual/positivity_bert.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features_conceptual/named_entity_recognition.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features_conceptual/moving_mimicry.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features_conceptual/mimicry_bert.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features_conceptual/index.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features_conceptual/function_word_accommodation.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/features_conceptual/content_word_accommodation.html Rebuilt HTML docs artifact (static asset hash changes).
docs/build/html/feature_builder.html Rebuilt HTML docs artifact reflecting new params/methods and removal of first-% method.
docs/build/html/examples.html Rebuilt HTML docs artifact with redundancy section and first-% deprecation.
docs/build/html/.buildinfo Rebuilt Sphinx build metadata.
docs/build/html/_static/searchtools.js Updates Sphinx search JS (includes a selector change needing correction).
docs/build/html/_static/pygments.css Rebuilt pygments CSS (color normalization/formatting changes).
docs/build/html/_sources/examples.rst.txt Rebuilt RST source artifact reflecting docs/source/examples.rst changes.
.gitignore Ignores .claude/, and adds *.csv/*.log patterns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/team_comm_tools/feature_builder.py Outdated
Comment thread src/team_comm_tools/utils/calculate_chat_level_features.py
Comment thread src/team_comm_tools/utils/check_embeddings.py Outdated
Comment thread src/team_comm_tools/utils/check_embeddings.py Outdated
Comment thread docs/build/html/_static/searchtools.js Outdated
Comment thread src/team_comm_tools/feature_builder.py
xehu and others added 4 commits June 1, 2026 01:20
The test_drop_redundant_columns.py tests (committed in 93b074e) read this
CSV, but it was silently skipped by the *.csv .gitignore rule, leaving the
test without its fixture on the remote. Force-add it so the test is runnable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The blanket *.csv ignore rule silently dropped new test fixtures (e.g.
test_redundant_columns.csv), requiring git add -f. Negate the rule for the
fixture directory so fixtures stage normally; CSVs elsewhere stay ignored.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@xehu
Copy link
Copy Markdown
Collaborator

xehu commented Jun 1, 2026

Made a few updates:

  • adds the name of the file being featurized to the logger header
  • ensures that columns that were originally metadata do not get flagged in the correlated groups; the correlated groups are only for features/columns outputted by our featurebuilder
  • updated documentation to include the deprecation and the correlation feature, with examples
  • add tests
  • quality the gitignore for .csv files so that it does not ignore tests

@xehu xehu self-assigned this Jun 1, 2026
The test workflow invokes pytest file-by-file, so test_drop_redundant_columns.py
was never collected. Add it to the Run tests step.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants